fix(prover-client): drop late stale-epoch reports + race-proof epoch teardown#23348
Draft
AztecBot wants to merge 1 commit into
Draft
fix(prover-client): drop late stale-epoch reports + race-proof epoch teardown#23348AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
…teardown Re-fix the flaky 'ProvingBroker > Retries > does not retry if job is stale' test (persisted variant) that dequeued PR #23344 from the merge queue. The failure surfaces as Error('Store is closed') from BatchQueue.execProcessor -> KVBrokerDatabase.commitWrites -> SingleEpochDatabase.batchWrite -> store.transactionAsync, triggered by a race between the broker's epoch cleanupPass and the persisted DB's in-flight batched writes. Two small, independent defenses: - proving_broker.ts: in #reportProvingJobError and #reportProvingJobSuccess, after the existing !item early-return, also drop the report when isJobStale(item). A late report for a job whose epoch has been (or is being) cleaned up never reaches the database. - proving_broker_database/persisted.ts: track deletedEpochs so commitWrites short-circuits for torn-down epoch directories; reorder deleteAllProvingJobsOlderThanEpoch to remove from the epochs map before await db.delete(); catch a residual 'Store is closed' inside commitWrites as a benign drop. Verified locally: 5 consecutive runs of proving_broker.test.ts pass, 102/102 tests each, including both in-memory and persisted variants of the previously flaky 'does not retry if job is stale'.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-fix the flaky
ProvingBroker > Retries > does not retry if job is stale(persisted variant) that caused PR #23344 to be dequeued from the merge queue (failing run, CI log key64a972aafaa40dd0)..test_patterns.ymlpreviously suppressed this exact failure; #23260 removed the suppression after cleaning up in-memory state, but didn't address the underlying race between the broker's epoch cleanup and the LMDB-backed batched writes, so the flake recurred. The failure surfaces asError('Store is closed')thrown out ofBatchQueue.execProcessor → KVBrokerDatabase.commitWrites → SingleEpochDatabase.batchWrite → store.transactionAsync.Fix
Two small, independent defenses:
proving_broker.ts— in both#reportProvingJobErrorand#reportProvingJobSuccess, after the existing!itemearly-return, also drop the report whenthis.isJobStale(item). A late report for a job whose epoch has been (or is being) cleaned up never reaches the database.proving_broker_database/persisted.ts:deletedEpochs: Set<number>socommitWritesshort-circuits for epoch directories that are being torn down.deleteAllProvingJobsOlderThanEpochadds todeletedEpochsfirst, then removes from theepochsmap before awaitingdb.delete()(which synchronously marks the store closed and then awaitsrm). This closes the window where a racingcommitWritescould fetch a still-mapped, already-closed store.Error('Store is closed')insidecommitWritesas a benign drop. Any other error still propagates.Verification
yarn workspace @aztec/prover-client test src/proving_broker/proving_broker.test.ts— 5 consecutive runs, 102/102 tests passing on each, including both in-memory and persisted variants ofdoes not retry if job is stale.Full analysis (failure timeline, the two races, why both defenses are needed): https://gist.github.com/AztecBot/612364332ec5a062bcf88ded9177334e
Test plan
does not retry if job is stalepasses deterministically in CI under merge-queue-heavy concurrencyproving_broker.test.ts(retries, timeouts, database management blocks) — verified locallyCreated by claudebox — group:
slackbotClaudeBox log: https://claudebox.work/s/731f628c79243769?run=1